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Our contribution will be short, but we will try to 
compensate by being particularly opinionated. The 
field of support vector machines (SVMs) and related 
kernel methods has produced an impressive range of 
theoretical results, algorithms and success stories in 
real-world applications. While it originated in ma- 
chine learning, it is also concerned with core prob- 
lems of statistics and it is thus timely to publish 
a comprehensive article that discusses these meth- 
ods from a statistician's point of view. We shall use 
this opportunity to make a few general comments, 
largely about the field rather than about the present 
paper. 

Many papers about SVMs start off saying some- 
thing like "SVMs are great because they are based 
on statistical learning theory" (this probably includes 
some of our own writings). Moguerza and Muhoz are 
more careful and only say that SVMs appeared in 
the context of statistical learning theory. What actu- 
ally is the connection between SVMs and statistical 
learning theory? 

Historically, SVMs and their precursors were (co-) 
developed by Vladimir Vapnik, one of the fathers of 
statistical learning theory. Statistical learning the- 
ory includes an analysis of machine learning which is 
independent of the distribution underlying the data. 
However, this analysis cannot provide any a priori 
guarantee that SVMs (or any other algorithm) will 
work well on a real- world problem. So what is special 
about SVMs, if anything? 

In our view, what is special about SVMs is the 
combination of the following ingredients: first and 
foremost, the use of positive definite kernels; then 
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regularization via the norm in the associated repro- 
ducing kernel Hilbert space; finally, the use of a con- 
vex loss function which is minimized by a classifier 
and not a regressor. 

The magic of kernels. Positive definite kernels and 
their feature space interpretation do provide a very 
nice way to look at a whole class of algorithms; how- 
ever, it is important to stress that they do not bring 
any statistical guarantee by themselves. The statis- 
tical guarantees available stem from the regulariza- 
tion (or learning theory) point of view. We shall re- 
turn to this point below. 

The main advantages of positive definite kernels 
are the following: 

1. They allow easy construction of a nonlinear algo- 
rithm from a linear one, often without incurring 
additional computational cost. 

2. They provide generality via the fact that they 
can be defined on nonvectorial data and do not, 
in general, require an explicit mapping to a re- 
producing kernel Hilbert space. 

Historically, the first point was initially considered 
one of the major advantages of kernels and it trig- 
gered a significant number of kernel algorithms other 
than SVMs, starting with kernel principal compo- 
nent analysis (PC A). More recently, the second point 
has arguably taken over the role of the key selling 
point for kernel methods. The application of learn- 
ing algorithms to nonvectorial data has become the 
field where nowadays a lot of the action is happening 
in the machine learning world, in particular concern- 
ing applications on structured data (e.g., in biology 
or natural language processing). We are curious to 
see whether the field of statistics will also embrace 
these possibilities. 

A sober look at the geometric interpretation. The 
geometric point of view is an original way to look at 
SVMs and quite possibly the right way to come up 
with an algorithm like the SVM in the first place. 
However, it does not yield comprehensive statisti- 
cal understanding. More precisely, there is no way 
to prove that large margin separating hyperplanes 
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perform better than other types of hyperplanes in- 
dependently of the distribution of the data. 

Sure enough, the geometric point of view does pro- 
vide intuition and motivates a large number of re- 
lated algorithms, but one should not be fooled by 
geometric intution or two-dimensional illustrations. 
The fact that data that are not linearly separable 
in input space suddenly becomes linearly separable 
in the so-called feature space (as depicted on Figure 
1 of the main paper) has led to misconceptions. In- 
deed, the picture seems to suggest that the kernel 
has magically placed the two clouds of points in two 
separate regions of the space, and hence uncovered 
the right decision boundary. 

The feature space often has a nonintuitive geom- 
etry. Let us take the example of the Gaussian Ra- 
dial Basis Function (RBF) kernel. The correspond- 
ing feature space is of infinite dimension and the 
points are all mapped to the positive orthant of the 
unit sphere. Any two disjoint point sets in input 
space can be separated by a hyperplane in this fea- 
ture space. 

There is thus something mysterious happening in 
this space, but this space is but one way to look at 
things. We might instead directly look at the SVM 
algorithm and see that it loosely speaking tries to 
combine functions of the form k(xi,-) using coef- 
ficients chosen to maximize the real-valued predic- 
tions yif{xi) on the training set. This brings us to 
the concept of margin. People usually say that max- 
imizing the margin is good for generalization. There 
are two concepts of margin to be distinguished: 

• Geometric margin (distance to the hyperplane). 
This is related to the norm of the weight vector, so 
that maximizing the margin corresponds to min- 
imizing the norm (i.e., to regularization) . Regu- 
larization can indeed lead to good generalization, 
provided the kind of smoothness enforced by the 
regularizer reflects the specifics of the problem. 

• Numerical margin [i.e., the quantity Vif{xi) which 
appears in the hinge loss used by the standard 
SVM]. The main reason why it makes sense to 
maximize this margin is because the hinge loss is 
a convex non-increasing upper bound of the clas- 
sification loss, so that making yif(xi) large will 
ensure that the hinge loss is small and thus that 
we minimize the number of misclassification er- 
rors. However, this only means that minimizing 
the empirical hinge loss might lead to minimiz- 
ing the empirical misclassification error, but does 



not guarantee that the expected misclassification 
error will be minimized as well. 

These two notions are quite distinct, yet they are 
sometimes confused because they are entangled in 
the algorithms. For instance, if one minimizes the 
hinge loss over linear combinations of kernels and if 
there exists a combination such that the total hinge 
loss on the training set is zero, then this combina- 
tion is not unique: we can multiply it by an arbitrary 
positive scale factor. Introducing a constraint on the 
norm of the weight vector is a natural way to remove 
this gauge freedom. This constraint is not innocent. 
It introduces a coupling between the numerical and 
the geometric margins: maximizing the geometric 
margin (in the context of an appropriate nonlinear 
kernel) leads to regularization which prevents over- 
fitting by penalizing complex functions, while max- 
imizing the numerical margin leads to minimization 
of the empirical error. Searching for a function with 
small empirical error while penalizing the complex- 
ity is the key to most reasonable learning algorithms. 

Convexity and loss functions. Another attractive 
feature of positive definite kernels is that they al- 
low nonlinearization of learning algorithm while pre- 
serving the convexity of the associated optimization 
problem. This is also one reason for the success of 
SVMs: the optimization problem is easier to handle 
than that of other algorithms such as artificial neu- 
ral networks. The introduction of SVMs with kernels 
in the machine learning community suddenly moved 
the focus from optimization algorithms (e.g., multi- 
ple variants of gradient descent) to optimization cri- 
teria. This has created significant interest in convex 
functionals (for all kinds of problems such as model 
selection, semisupervised or unsupervised learning) 
and methods of convexifying existing functionals. 

In the context of supervised learning, this search 
for convexity has led to the introduction of many 
different convex loss functions. However, something 
that has often been overlooked is the set of proper- 
ties the loss function has to satisfy so that it leads 
to a consistent algorithm. For example, in the clas- 
sification setting, a minimum requirement is that 
with sufficient data, minimizing the loss should lead 
to minimization of the misclassification error. For 
standard SVMs, the fact that the hinge loss sat- 
isfies this property was noticed relatively late (see 
reference [40] of the main paper) and, more surpris- 
ingly, in the context of multiclass classification this 
has been addressed only very recently. It has been 
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proved in [1] that several variants of multiclass SVM 
do not have the required property. Of course, this 
is not to say that they perform poorly on a finite 
sample, but it is important to understand what an 
algorithm is aiming at and how it should behave as 
the sample size increases. 

Moguerza and Muhoz are indeed aware of the fact 
that the minimizer of the hinge loss is the Bayes 
classifier (or rather is a function which has the same 
sign as that of the Bayes classifier), but they later 
say that there is still work to be done to provide a 
probabilistic interpretation of the output values pro- 
duced by SVM classifiers. This is somewhat prob- 
lematic because, at least asymptotically, there is no 
possible relationship between probabilities and out- 
put values. [This follows from the consistency prop- 
erty: With an appropriate kernel, the values of the 
function produced by the SVM algorithm will con- 
verge to exactly +1 or —1 on all the points where 
P(Y\X) G]0, l/2[U]l/2, 1[ so that the value of f(X) 
will have no relationship to P(Y|X).] Hence, on a 
finite sample, if a relationship occurs, it will likely 
be by pure chance (or because the kernel happens 
to regularize exactly in the way needed for the pre- 
ferred functions to look like the conditional proba- 
bility density function). 

To conclude this section with a more philosoph- 
ical viewpoint, let us mention that the SVM algo- 
rithm also reinforces the belief that one should be 
concerned about the objective rather than about 
the model: what is important is not whether one 
can identify the "true" target function; rather, one 
should try to find some function, from a large class, 
which will perform well. This belief is shared by 
many researchers in the machine learning commu- 
nity, and it probably distinguishes them from "clas- 
sical" statisticians, as argued, for example, in [3]. 

Theoretical considerations. Regarding the statis- 
tical analysis of the SVM algorithm, besides the 
works cited in the paper, there are a few additional 
references that are worth mentioning; for example, 
[6] first proved universal consistency of Li-SVM with 
a Gaussian kernel, while Steinwart and Scovel [8] 
and Steinwart [7] obtained rates of convergence un- 
der various conditions. Also, more recently, the con- 
sistency of SVM has been proved by Vert and Vert 
[9] in the case where the regularization parameter is 
held fixed, but the kernel width goes to zero. This 
suggests that there is a coupling between both types 
of regularization (provided by a small norm of the 
function and a large kernel width). 



It is now clear that the VC dimension is not the 
right parameter to capture the rates of convergence, 
especially when studying real- valued functions classes 
Alternative possibilities (based on Rademacher av- 
erages) along with finite-sample performance bounds 
can be found, for example, in [2]. 

Progress has also been made in understanding the 
role of sparsity in SVMs. First of all, the number of 
support vectors is asymptotically linear in the sam- 
ple size if the Bayes error is nonzero. Second, on 
large data bases the number of support vectors is 
usually too large for fast testing (hence the develop- 
ment of reduced set methods which can be applied 
to nonsparse models [4, 5]). 

Why do SVM work so well in practice? There is 
probably no theoretical answer to this question. The 
fact that they are universally consistent is surely in- 
teresting, but does not explain anything about finite 
sample performance on real- world data sets (e.g., 
the fe-nearest neighbor algorithm is also universally 
consistent). The sparsity also does not explain it. 
Regularization (by the kernel width and by the func- 
tion norm) surely plays a role (by preventing overfit- 
ting) but this cannot be quantified. Indeed, in sta- 
tistical terms, one can only tell the effect of regu- 
larization on the variance but not on the bias, at 
least if one does not make specific assumptions on 
the smoothness of the target function. The only pos- 
sible answer to this question might thus be that on 
those problems where SVMs excel, the kernel that 
is used induces a regularizer that incorporates ap- 
propriate prior knowledge about the problems or, 
equivalently, it captures the right notion of similar- 
ity. In a large majority of applications, the Gaussian 
RBF kernel is used and its success simply means 
that the Euclidean distance in input space is locally 
meaningful for those problems. [Indeed, the Gaus- 
sian kernel incorporates a notion of similarity which 
is a monotonic function of the Euclidean distance. 
In this case, the SVM produces a "local rule": The 
prediction at a given point is a weighted combina- 
tion of the labels of nearby points (where the weight 
mainly depends on the distance and is adapted by 
the coefficients Aj which appear in equation (3.4) of 
the main paper).] 

Future directions for research. Although, as ex- 
plained by Moguerza and Munoz, the SVM algo- 
rithm in itself has several interesting merits, we think 
that what is most important about it is its impact 
on the field of machine learning and statistics. It has 
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introduced new concepts and ideas that have con- 
siderably influenced their progress, and we expect 
that the acquired momentum will lead to further t 2 ] 
advances, in domains such as structured learning, 
joint kernels (mixing inputs and outputs), links to 
graphical models, and semisupervised learning, to r 3 i 
name but a few. In a different direction, one could 
try to extend the notion of kernel so as to handle 
higher level similarities, such as analogies (which [4] 
can be considered as similarities between pairs of 
examples). . . 

There are also several important questions that 
need to be addressed so as to bridge the gap between rgi 
basic research and applications. For instance, there 
is no satisfactory method for choosing the param- 
eters other than using cross-validation, which can [7] 
be an obstacle in applications. Moreover, there are 
still significant computational issues arising from the , g , 
implementation of SVM-like algorithms using non- 
linear kernels for large-scale problems. 
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